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Abstract: We propose a wavelet-based technique for the nonparametric 
estimation of functions contaminated with noise whose mean and variance 
are linked via a possibly unknown variance function. Our method, termed 
the data-driven wavelet-Fisz technique, consists of estimating the variance 
function via a Nadaraya- Watson estimator, and then performing a wavelet 
thresholding procedure which uses the estimated variance function and local 
means of the data to set the thresholds at a suitable level. 

We demonstrate the mean-square near-optimality of our wavelet esti- 
mator over the usual range of Bcsov classes. To achieve this, we establish 
an exponential inequality for the Nadaraya- Watson variance function esti- 
mator. 

We discuss various implementation issues concerning our wavelet esti- 
mator, and demonstrate its good practical performance. We also show how 
it leads to a new wavelet-domain data-driven variance-stabilising transform. 
Our estimator can be applied to a variety of problems, including the esti- 
mation of volatilities, spectral densities and Poisson intensities, as well as 
to a range of problems in which the distribution of the noise is unknown. 
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1. Introduction 

A paradigmatic problem in nonparametric regression is the estimation of a one- 
dimensional function a : [0, 1] i— > R from noisy observations Xt taken on an 
cquispaccd grid: 

Xt ^ a{t/Ti) + e-t, t = l,...,n, (1) 

where the et's are random variables with E(et) = 0. Various subclasses of the 
problem can be identified, depending on the smoothness properties of a and the 
joint distribution of (et)"=i- 

Since the seminal work of Donolio and Johnstone (1994), estimation tech- 
niques based on non-linear wavelet shrinkage have become a commonly used 
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tool in nonparamctric function estimation, extensively studied in the statisti- 
cal literature. Many of them combine excellent finite-sample performance, lin- 
ear computational complexity, and optimal (or near-optimal) asymptotic Mean- 
Square Error behaviour over a variety of function smoothness classes, including 
functions of a low degree of regularity. This often puts them at an advantage 
compared to linear estimation techniques (such as kernel smoothing) in set- 
ups where the function a has discontinuities or exhibits an otherwise irregular 
behaviour. A comprehensive overview of wavelet methods in statistics can be 
found, for example, in Vidakovic (1999). 

The main idea underlying most wavelet techniques is that upon transforming 
the original regression problem (1) via a "multiscale" orthonormal linear trans- 
form W called the Discrete Wavelet Transform (DWT) , the following regression 
formulation is obtained: 

Yj.k ^ fJ.j,k + £j,k, j = 0, . . . ,log2 n - 1, fc=l,...,2-', 

and fc = 1 for j = —1, where j and k are (respectively) scale and location param- 
eters, Yj.fc are the empirical wavelet coefficients of Xt, Hj^k are the true wavelet 
coefficients of a{t/n) which need to be estimated, and £j^k are the wavelet coef- 
ficients of Et- The sequence ^Xj^k is often sparse, with most ^j,fe's being equal or 
close to zero, which motivates the use of simple thresholding techniques that do 
not estimate jij^k by zero if and only if the corresponding Yj ^ exceeds a certain 
threshold in absolute value. This ensures that a large proportion of the noise Sj^k 
gets removed. The inverse DWT then yields an estimate of the original function 
a. 

The overwhelming majority of wavelet-based estimation techniques, such 
as those proposed by Donoho and .Johnstone (1995), Johnstone and Silverman 
(2005a) or Abramovich et al. (2006), to name but a few, were devised under 
the assumption that the errors (et)"=i formed an independent, identically dis- 
tributed Gaussian sequence. This was partly due to the fact that in view of 
the orthonormality of W , the "wavelet noise" £j^k was then also i.i.d. Gaussian, 
which facilitated both the choice of thresholds and the theoretical analysis of the 
resulting estimators. Johnstone and Silverman (1997) proposed an extension of 
the wavelet thresholding paradigm to stationary Gaussian noise. 

In practice, the assumption of Gaussianity is violated in many important 
estimation problems. We list and discuss a selection of them below. 

• Poisson intensity estimation. In Poisson intensity estimation, Xt are mod- 
elled as independent Pois{a(t/n)} variables, which implies that et are 
centered Poisson. The mean and variance of Xt are linked via the rela- 
tionship var(Art) = h{E{Xt)} with h{u) = u. This is in contrast to the i.i.d. 
Gaussian model in which h{u) — const. Recent examples of wavelet-based 
(or otherwise multiscale) Poisson intensity estimation techniques include 
the Bayesian methods of Kolaczyk (1999) and Timmcrmann and Nowak 
(1999), the multiscale likelihood technique of Kolaczyk and Nowak (2004), 
the Haar-Fisz method of Fryzlcwicz and Nason (2004) and the extension 
of the latter proposed by Jansen (2006). Some of the above, as well as 
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some other, techniques are reviewed in Bcsbcas et al. (2004). The work by 
Sardy et aL (2004), amongst other contributions, proposes an automatic 
smoothing procedure for Poisson data. 

• Nonparametric volatility estimation. Nonparametric volatiUty estimation 
techniques are widely used in the finance industry (for example by Risk- 
Metrics, as detailed in their "Technical Document" available from 
http://www.riskmetrics.com/pdf/td4e.pdf). In this sct-up, the Xf^s rep- 
resent squared log-returns on a financial instrument and are modelled as 
independent and distributed as Xt = a{t/n)Zf, where ^{Zf) = 1. Note 
that et = a{t/n){Zf — 1). Thus, the model is multiplicative and the vari- 
ance function h{u) is proportional to u^. A multiscale Haar-Fisz technique 
for nonparametric volatility estimation was proposed by Fryzlewicz et al. 
(2006). 

• Spectral density estimation. In spectral density estimation based on the pe- 
riodogram, the X^'s represent periodogram ordinates and are assumed to 
be asymptotically independent and asymptotically distributed as a{t/n) Z^ , 
where a{t/n) represents the spectral density at frequency t/n, and Zf 
are Exp(l) random variables. This again renders the set-up multiplicative 
and, asymptotically, the variance function takes the form h{u) = u^. Re- 
cent wavelet and multiscale approaches to periodogram smoothing include 
Moulin (1994), Neumann (1996), Gao (1997a), Pcnsky ct al. (2007) and 
Fryzlewicz et al. (2008). 

In the above examples, the variance function h{u) is assumed to be known (as is 
the case in the work of Antoniadis and Sapatinas (2001) and Antoniadis ct al. 
(2001)), and all of the multiscale approaches listed above, in one way or an- 
other, make use of its exact form in order to set the threshold at the "right" 
level: the threshold value usually depends on Var{ej,k), which involves h. How- 
ever, in many estimation problems modelled by (1), it is clear that there exists a 
non-trivial mean- variance relatonship, but its exact form is unknown. Thus, it is 
not a priori clear what threshold values to use. For example, in gene expression 
data observed in microarray experiments, Rocke and Durbin (2001) identified 
that the variance of raw pixel intensities increased with their mean. An inter- 
esting mean- variance relationship arising in solar irradiance data was described 
in Fryzlewicz ct al. (2007). Also, even if the variance function is "assumed to be 
known" , the model which implies its particular form might have been chosen in- 
correctly. Thus, even in such a case it may often be safer to infer the form of the 
variance function from the data in order to set the thresholds at a suitable level. 

A seemingly attractive solution to the problem of the unknown variance 
is to apply a data-driven variance stabilisation technique prior to smoothing 
the transformed data by means of a wavelet-based technique suitable for ho- 
moscedastic noise. Examples of data-driven variance-stabilising transforms in- 
clude the ACE method of Breinian and Friedman (1985), the AVAS technique 
of Tibshirani (1988) as well as the method of Linton ct al. (1997). In the context 
of one-colour microarray data, we mention the generalised log transformation 
proposed by Rocke and Durbin (2001) and the Spread- Versus-Levcl technique 
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of Archer (2004). As these transforms operate on the original data, we refer to 
them as "time-domain" transforms below. 

Despite its appealing modularity, a three-stage estimation procedure which 
involves a time-domain data-driven variance-stabilising transform followed by 
wavelet smoothing of the transformed data and finally the inverse transform, is 
not always recommendable. Firstly, the performance of time-domain data-driven 
variance stabilisation procedures is often less than satisfactory, as illustrated in 
Fryzlewicz ct al. (2007). Secondly, due to the random and often highly nonlinear 
character of such transforms, theoretical properties of the resulting three-stage 
estimator may not be easy to establish. 

Another possible route to follow is to treat the estimation problem (1) as 
an instance of the problem of estimating a function contaminated with locally 
stationary noise. Solutions to the latter were proposed, amongst others, by Gao 
(1997b) and von Sachs and MacGibbon (2000) (see also the references therein). 
However, adapted to our setting, these approaches would mean ignoring the 
fact that the local mean and variance were linked via a variance function h, 
and simply pre- estimating the evolution of the variance of Xt over time. Thus, 
these methods can potentially be suboptimal in our context, as they do not take 
advantage of all available information. 

In this paper, we propose an alternative approach to the problem of esti- 
mating a when h exists but is unknown, which consists of first estimating the 
variance function h, and then constructing a wavelet thresholding estimator of 
a which makes use of the estimate of h. Our method, termed the data-driven 
wavelet-Fisz estimation technique, overcomes the drawbacks mentioned above 
in that (a) it performs well in practice, (b) it only requires that the variance 
function h, and not the target function a, be "smooth" , and (c) the theoretical 
performance of the final estimator of a is possible to quantify and near-optimal. 
In addition, our estimator of a is rapidly computable and easy to implement. 
Its simple modification can also be used in cases in which the variance function 
h is known. Thus, it is applicable to a wide range of problems, including all of 
the examples mentioned above. The added benefit of our approach is that as 
well as estimating a, it also returns an estimate of h, which may be of interest 
to the analyst. 

Our estimator of /i is a simple Nadaraya- Watson estimator, and is inspired by 
(but simpler than) the variance function estimator used by Cliiou and Miiller 
(1999) in a different context. The reason for preferring simplicity here is that in 
order to derive theoretical properties of the resulting estimator of a, we need to 
establish an exponential inequality for the estimator of h. The latter piece of the- 
ory forms a large part of this work, and we hope that it may be of independent 
interest. We note that a large deviation theory for a class of Nadaraya- Watson 
estimators was obtained by Louani (1999) and Joutard (2006). However, it was 
done in a simple nonparametric regression set-up with independent errors, which 
is not applicable in the context of variance function estimation. We also note 
that the theoretical part of our work does not address the automatic choice of 
the smoothing parameter occurring in the estimator of h. However, the simu- 
lations section provides detailed practical recommendations as to the choice of 
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this parameter. This selection is also performed automatically in the software 
package which accompanies this paper. Finally, we note that what we mean by 
the term "variance function estimation" is different from the use of the same 
term by some other authors, for example Cai and Wang (2007) and Wang ct al. 
(2008), who consider the estimation of the variance function as a function of 
time, not as a function of the mean. However, a similar element in our work 
and the above-cited articles is the fact that in both estimation problems, a 
preliminary estimator of the mean is used to estimate the variance function. 

It is interesting to note that the algorithm for computing our estimator of 
a can be decomposed into three separate stages, the first and last of which 
is a particular data-driven variance-stabilising transform, and its inverse, re- 
spectively. In the rare cases when the exact standard deviation of the noise is 
known, the simplest time-domain variance stabilisation procedure consists of 
using this quantity to pre-divide the original regression set-up coordinate- wise, 
as discussed by Gao (1997b). Unlike this and other existing transforms, some 
of which are listed above, our variance-stabilising transform is performed in the 
wavelet domain, as opposed to the time domain. Roughly speaking, the trans- 
form consists of dividing each empirical wavelet coefficient Yj^k by an estimate 
of its own standard deviation, the latter involving the estimate of h. In a non- 
wavelet context, similar ratio statistics (for a known function h) were studied 
by Fisz (1955), which justifies the name of our procedure. 

We also mention that this paper was inspired by our earlier work Motakis ct al. 
(2006) and Fryzlewicz et al. (2007), where we proposed computational proce- 
dures related to that described here. However, they were not accompanied by 
any theoretical analysis, partly because any such analysis appeared challeng- 
ing due to the level of complexity of the proposed algorithms. Indeed, one of 
the aims of this paper is to provide a procedure which is simple enough to be 
theoretically tractable, but also performs well in practice. 

The paper is organised as follows. In Section 2, we describe our model and 
estimation problem. In Section 3, we introduce and analyse our wavelet-Fisz 
technique in the case when the variance function h is known. Section 4 describes 
the Nadaraya- Watson estimator of h and considers its theoretical properties. 
In Section 5, we show how the estimator of h from Section 4 is used in our 
data-driven wavelet-Fisz estimator of a, and establish the mean-square near- 
optimality of the latter. In Sections 6 and 7, we discuss various implementation 
aspects of our estimators of h and a, respectively. Section 8 demonstrates how 
our wavelet estimator leads to a new data-driven variance-stabilising transform 
performed in the wavelet-domain. Section 9 concludes, and the proofs of our 
results are in three appendices. R code implementing our method has been 
made available on http://www.maths.bris.ac.uk/~mapzf/ddwf/ddwf.html 

2. Model and preliminaries 

We consider the regression model 



Xt = a{t/n) -\- et, t=l,...,n, 



(2) 
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where Xi, . . . , Xn are assumed to be nonncgativc and independent, with E{Xt) = 
a{t/n) and Var(Xt) = Va,r{et) = h{a{t/n)}. Our task is to estimate a nonpara- 
metrically via nonhnear wavelet shrinkage, assuming that the function h is not 
necessarily known. 

We place the following assumption on the function a. 

Assumption 1. For the function a{z) : [0,1] i-^ M, we denote a = inf^ q;(2;) 
and a ~ sup^ a{z). We assume 

(i) a is of finite total variation over [0,1], 
(a) < a <a < oo. 

Assumption l(i) is a mild smoothness assumption on a. Since the class of 
bounded variation functions also includes functions of a low degree of regularity, 
the choice of nonlinear wavelet shrinkage as the preferred estimation method 
appears natural in this context. 

As mentioned earlier, our estimator of a can be viewed as using the principle 
of variance stabilisation (in the wavelet domain) . Many time-domain variance- 
stabilising transforms, such as the square-root transform for Poisson data or the 
log transform for scaled data, would require that the function a be bounded 
from below, as specified in Assumption l(ii). Therefore, it is not surprising that 
we also require this assumption to hold. 

Further, we impose the following assumption on h. 

Assumption 2. For the function h : [0, oo) i-^ [0, oo) we denote h = 
infuefa.Q] h{u) and h — sup„g[^_^j h{u). We assume 

(i) < h< h < oo, 
(a) h is non- decreasing, 

(Hi) h is Lipschitz-continuous of order 1 on u £ [ffi, a] with constant H , 
(iv) there exist 6,6, H > such that /i* is Holder- continuous on u ^ [0, oo) 
with Holder exponent 6 and constant H . 

The class of distributions with a variance function h satisfying Assumption 2 
includes, amongst others, the Poisson distribution, for which h(u) = u, and 
distributions of the form Xt = a{t/n)Zt where {Zt}t are i.i.d. with E{Zt) = 1, 
for which h(u) is proportional to w^. 

Finally, we make the following assumption about the central moments of St- 

Assumption 3. We assume that there exists a positive constant K and a non- 
negative constant 7 such that 

E|et|' = E|At - a{t/n)\^ < {11^+'' K^-^h{a{t/n)} 

for I ~ 3,4, . . . and all t. 

Assumption 3 is natural and common in the context of wavelet estimation 
in non-Gaussian noise, see for example Neumann (1996). It is satisfied by many 
standard distributions, including, amongst others, Poisson and gamma. Roughly 
speaking, it ensures that local sums of Xt are asymptotically normal in a certain 
asymptotic regime and in a certain required "strong" sense. 
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3. Wavelet-Fisz estimation for h known 

In this section, we aim to estimate a using nonlinear wavelet shrinkage assuming 
that the variance function h is known. Throughout the paper, we assume basic 
familarity with the Discrete Wavelet Transform (DWT). We refer the reader to 
Mallat (1989) for a description of the DWT, and to Vidakovic (1999) for an 
excellent overview of wavelet methods in statistics. 

We now describe, step by step, our algorithm for computing the wavelet-Fisz 
estimator of a if the function h is known. 

1. The starting point is the DWT of the observed data with respect 
to an orthonormal basis of compactly supported wavelets. Later, in As- 
sumption 4, we will specify additional technical conditions on the wavelets. 
The DWT converts the regression problem (1) into a regression problem 
in the wavelet domain 

Yj^k = f^j.k + £j.k, j = 0, ...,J-1, k = l,...,2^, 

where J = log2 n, with the only "smooth" coefficient indexed by {j, k) = 
(—1, 1). The variables Yj^k are the empirical wavelet coefficients of Xj, the 
constants ^j^k are the wavelet coefficients of a{t/n) which need to be esti- 
mated, and the "wavelet noise" variables £j^k are the wavelet coefficients 
of Ef. The sequence ^j^k will often be sparse, with most ^ij^s being equal 
or close to zero. 

2. We then separate the indices (j, fc) into two groups: those corresponding 
to the coarser scales < j < J* — 1, for which Sj^k will be asymptotically 
normal, and those corresponding to the finer scales J* < j < J — 1, which 
will be "ignored" in the estimation procedure. To be more precise, we 
define J* and a set X„ as follows: 

In = {ij,k)eZ^ : l<k<2^; < j < J* - 1; 2-'' =71^-'}, 

for a fixed e S (0, 1/3]. The choice of 1/3 as the upper bound for e is linked 
to the postulated smoothness of a. The reader is referred to Section 3.3 
of Fryzlewicz ct al. (2008) for a more detailed discussion of this issue. 

3. As the sequence iij^k is likely to be sparse, with most fij^s being equal 
or close to zero, we use a simple thresholding technique, which does not 
estimate Hj^k by zero if and only if the corresponding Yj.fc exceeds a certain 
threshold in absolute value. This ensures that a large proportion of the 
noise Sj^k gets removed. 

In wavelet function estimation with Gaussian errors, possibly the simplest 
("universal") threshold, advocated by Donolio and Johnstone (1994), takes 
the form 

A,^, = {2Var(e,,fc)log(#J„)}i/2, (3) 

where is the cardinality of the set A. Since in the set In, our wavelet 
coefficients Sj^k are asymptotically Gaussian, we wish to explore the possi- 
bility of applying an analogous threshold in our set-up. To effect this idea. 
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we need to determine Var(ej_fc). Denoting by {V'^.tIt the elements of the 
discrete wavelet vector at scale j, we find 

Var(e,, ,) - Var ^^,,k-te}j = ^ (4) 

Obviously, a(t/n) is unknown and needs to be pre-estimated. For simplic- 
ity and speed of computation, we use the same pre-estimate for each t in 
supp('!/'j\fc- ), namely the following localised mean of Xt: 

a{t/n) =^Kj^k-qXq, (5) 

where the constants k^^t satisfy Assumption 5 below. As the discrete 
wavelet vectors are normalised so that J^t'^'j k-t ~ obtain our "es- 

timated" thresholds as 

kk = h}'^ ( V K,,u-,XA {2 log(#Z„)}i/2. (6) 



(As a side remark, we note that in the absence of an assumed mean- 
variance relationship, Gao (1997b) estimates thresholds as proportional 
to the "running median absolute deviation" estimate of the standard de- 
viation of the heteroscedastic noise.) 

We use our estimated thresholds to estimate fij^k in the set Z„ via either 
the soft or the hard thresholding rule: 

= sign(y,,fc)max(|y,,,|-A,,fe,0) (7) 
= y,,a(|>S-,^|>A„,), (8) 

where I(-) is the indicator function. Outside the set Z„, we simply estimate 
/ij^fc by zero, that is 

A5S = /i5!2 = for (j,fc)e{j>o}nx,';. 

4. Let (e) denote either one of: (s) or [h). The inverse DWT of the sequence 
i^Yk yields our wavelet-Fisz estimator a'^'^\ 
A few remarks are in order. 

Stability. Let stdev(Ar) denote the standard deviation of a random variable X. 
Looking back at the derivation in formula (4), another "obvious" way of estimat- 
ing stdev(ej_fc) would be to set stdev(£j,fc) ~ {X^t V^f fe-t^{-'^t}}^^^- However, 

comparing it to our estimator stdev(ej,fe) = h^^"^ (^tK,j,k-tXt) from formula 
(6), it is easily seen that the latter typically involves lower powers of Xt, and 
thus is potentially a more "stable" statistic. As an example, consider h{u) = . 
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In this case, stdev(£j^fc) is a localised I2 norm of Xt, whereas stdev(£j^fc) is a 
localised li norm. A similar comment applies in the case h{u) for all 

(3>l. 

Link to maximum likelihood estimation. If a{t/n) is constant over the sup- 
port of Kj^k--, then, by the invariance principle of maximum likelihood esti- 
mators, our estimator stdev(ej_fe) is precisely the maximum likelihood estimator 
of stdev(£j.fc); provided that Kj.fc-t = const for t G supp(Kj_fc_.). 

The name "wavelet- Fisz". Note that the argument of the indicator function in 
(8) can be rewritten as 



> {2 log(#J„)}i/2 (9) 



In a non- wavelet context, ratio transformations similar to that on the left-hand 
side of (9) and the asymptotic normality of the resulting random variables were 
studied by Fisz (1955), which justifies the name of our procedure. The division 
in (9) provides a degree of "variance stabilisation": note that the threshold 
{2 log(#X„)}^/^ is suitable for standard homoscedastic normal noise. In this 
sense, our procedure can be viewed as being based on the principle of variance 
stabilisation in the wavelet domain. We expand on this issue later in Section 8. 

Link to Fryzlewicz et al. (2008). Wc note that in the special case h{u) = u^, our 
estimation algorithm is equivalent to the method proposed by Fryzlewicz et al. 
(2008) in the context of spectral density estimation. 

We now establish the mean-square convergence rate of our wavelet-Fisz esti- 
mator a^*^' . In order to do so, we specify assumptions on the wavelets ^/'jjt and 
the constants Hj^k- 

Assumption 4. The discrete wavelets used in the construction of a^'^' are de- 
rived from a continuous-time orthonormal wavelet basis o/L2[0, 1], {0o,fc(-z)}fc U 
{'^jAz)}j>o,k, where c^j^z) = 2^/20(2^2 - k) and V'j,fc(z) = 2^/2^(2^2 - k). 
The "mother" and "father" wavelet functions "0 and (j) are assumed to satisfy, 
for some r > m (with m given in Theorem 1 below), 

(i) (j) o.nd ijj are in the space C, 
(li) ]cj){z)dz = I, 

(Hi) J ip{z)z'-dz = for all < I < r . 

Assumption 4 defines the so-called r-regularity of the wavelet basis, and is 
commonly used in wavelet function estimation. 



Assumption 5. The constants Kj^r > are such that 



= o(2-^) 
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max Kj^r = 0(2^ 

/or alio <j<J*-l. 

Note that each of the vectors kj can be interpreted as a hnear filter which 
computes the local mean of Xt over the support of the vector ipj in (5). 

As has now become standard in the wavelet literature, we assume that the 
unknown function a is in a Besov ball of radius C > on [0, 1], = J-"™^(C), 
where m > and < p, q < oo. Roughly speaking, the not necessarily integer 
parameter m indicates the number of derivatives of a, where their existence in 
required in the Lp-sense, and thus p can be viewed as a measure of the inho- 
mogeneity of a. The additional parameter q provides a further finer gradation. 
Besov classes include the traditional Holder and Sobolev classes (p = q = oo 
and p = q = 2, respectively). If the regularity r of a wavelet basis is greater than 
m and if other conditions of Assumption 4 hold, then the membership of a in 
T"^ can be characterised in terms of the wavelet coefficients /i^ — fij^kn'^^^ of 
the function a in the following way. Define the Besov sequence ball of radius C 
as 

[ j>o 

where s = m+l/2 — l/p and \\fJ-'j\\lp = J2k=i l/^j fcl^- Assumption 4 holds, then 
a is in !F"^ if and only if {^'j k}j.k is in b^^{C). The reader is referred to Meyer 
(1992) for rigorous definitions and a detailed study of Besov spaces. 

Denote ||w||^^jp ^ = \v{u)\'^du. We are now ready to state a result on the 
mean-square rate of convergence of our wavelet-Fisz estimator a'-'^-' . 

Theorem 1. Let (e) denote either one of: (s) or (h). Suppose that Assump- 
tions 1, 2, 3, 4 cmd 5 hold. We have 

sup E||5(^) - = O {(log ,Vn)2™/(2"+i)} . 

The rate O is the best possible mean-square error rate for 

Besov spaces, and our wavelet-Fisz estimator achieves it up to a logarithmic 
term, attaining the same rate as the universal thresholding estimator in the 
case of i.i.d. Gaussian noise. We mention that linear estimators, such as kernel 
estimators, cannot attain the optimal mean-square error rate (not even up to a 
logarithmic factor) for p < 2. 

4. Estimation of the variance function h 

In this section, we assume that the function h is unknown, and we propose to 
estimate it by means of a Nadaraya- Watson estimator h. Later, in Section 5, h 
will be used in the data-driven wavelet-Fisz estimator of a. In order to establish 
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the mean-square convergence of the latter estimator, we need to be able to 
determine large deviation properties of h. Indeed, the main aim of this section 
is to demonstrate an exponential inequality for h (which will be sufficient for 
our purposes). 

The estimator h is constructed as follows. We start with a preliminary esti- 
mator of a{t/n) de&ied by 



t+M 

i 

at 



-, t+M 

— E (10) 



2M , . 

p=t~M 

where the choice of M will be discussed in the paragraph underneath Theorem 2, 
as well as in Section 6. We define empirical residuals from this fit by it = 
Xt — at- Our Nadaraya- Watson estimator h performs kernel smoothing of the 
squared empirical residuals e^. More specifically, we use a kernel function K 
which satisfies the following assumption. 

Assumption 6. The function K : [—1/2,1/2] i-^ M. is nonnegative, bounded, 
integrates to one and is Lip schitz- continuous of order 1 with constant L. We 
denote K = max^ K{z). 

We define 

^ ;^-(^) <") 

■,xr I \ I f at-u 

no 

where the choice of b will also be discussed in the paragraph underneath The- 
orem 2, as well as in Section 6. The Nadaraya- Watson estimator h is defined 

by 

Yrt=iWnt{u)e] 



h{u) 



E"=l^nt(w) 



We now list and clarify a number of assumptions which will be used in proving 
the main result of this section. 

Assumption 7. We have Var(ef ) > c > 0, uniformly over t. 

Assumption 8. Denote Zt ~ \at — E(dt)|. We assume that there exists a 
positive constant C2 such that 

Var(dt) = Var(at - E(at)) < C2 Var(Zt), 

uniformly over t. 

Assumption 9. We assume that there exist positive constants a, d such that 



P\ sup |et| < a log^n ) = 1 - 0(71"^). 

Vt=l,...,n 
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Assumption 10. There exists a function c{u) such that 

n 

0<Ci <C{U) <Y,Wntiu) 
t=l 

uniformly over n and b, for all u G rangc{Q!(z)}. 

Assumption 11. Let the constants Kj^r satisfy Assumption 5. We assume that 
the function a{z) is such that for all (j, k) S I„, 

'^Kj^k-qa(q/n) e rangc{a(z)}. 

Assumption 7 ensures that ej is a non-degenerate random variable. 

For any random variable Y, it is easy to see that Var|y| < Var(y). Assump- 
tion 8 guarantees that the converse is true, up to a constant. This is not a 
restrictive assumption: we expect that at — E(dt) will be close to A^(0, cr^), and 
for Y ~ N{0,a'^) we have Var(y) = ^Var|r|. 

Assumption 9 is satisfied for all distributions whose tail decays exponentially. 

We now demonstrate that Assumption 10 is satisfied for functions a{z) which 
are piecewise Lipschitz-continuous of order 1 with a finite number of breakpoints. 
For clarity, we only show it for the triangular kernel K{z) = (~4|z| + 2)I(|2| < 
1/2), but the proof remains almost unchanged for other kernels. 

Proposition 1. Let K{v) ~ (— 4|w| + 2)I(|?j| < 1/2) and let a{z) be piecewise 
Lipschitz-continuous of order 1 with a finite number of jumps. There exists a 
positive constant ci such that X]"=i ^nt{u) > ci uniformly over n, b, and u € 
range{a(z)}. 

We note that Assumption 10 can be relaxed to include functions which are 
piecewise Holder continuous, at the expense of worse rates of tail decay in 
Theorem 2. As an interesting example, wc note that Donoho and Johnstone's 
(1994) benchmark signals blocks, bumps, doppler and heavisine are all piecewise 
Lipschitz-continuous of order 1. 

Assumption 11 ensures that "local averages" of the function a lie within the 
range of a. 

We now state an exponential inequality for the estimator h{u), which is the 
main result of this section. 

Theorem 2. Suppose that Assumptions 1-3, 6-11 hold, and that the constants 
Kj^T satisfy Assumption 5. Letbn, d„ and in be any fixed sequences such thatbn = 
o{(n/Af)i/(6+47)}^ d,^ = o|minj 2 ^(i+°°"4,i)) } and i„ = o {(n62)i/(io+i27)}^ 

where M, b and 7 are defined in formulae (10) and (11) and Assumption 3, 
respectively; 2"^ — n, and the range of j is < j < J* — 1. Let Si be any 
positive quantity such that di < c\ where c\ is defined in Assumption 10, and 
define 5' = 5[ci — 5i)/2, where 5 appears in the exponential inequality below. 
Assume 
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(5 > bH, (14) 

as M,n/M,nb — » 00 and 6 — > 0, where d is defined in Assumption 9 and 
H is the Lipschitz constant for h(u) over u £ range{Q!(2)}. In the asymp- 
totic limit, as AI,n/M,nb — > 00 and & — > 0, we have, uniformly over {j,k) G 

T 



P 



> S 



< C3(2M + 1) {4 - $(niin(a„, b„)) - 2$(min(e„, 6„)) - $(min(.g„, 6„))} 

+C4 {3 - $(niin(c„, dn)) - 2$(niin(/„, d„))} + C5 {1 - $(min(/i„, i„))} 
+0(n-2), 

where C3, C4 and C5 are positive constants, $ is i/ie cd/ 0/ the standard normal, 
and 



a„ 
e?? 



j 



/„ = 0(<5'62inin2('^-^)/2log-2'^n) 



hn = 0{(,5 - fei7)n^/^6}. 

Explanation of the rates. We take M = 0{n^) for e (0,1) and 6 = 0{n^'^) 
for C £ (0,1) and investigate conditions on ^ and ( which ensure that the 
assumptions of Theorem 2 are satisfied. Clearly, if C < 1/2, then &„, d„ and 
in can be chosen such that they are all of order 0{n'-), for > 0. Fixing to 
be constant, and assuming that 6 and 6' cither are constant or tend to zero no 
faster than logarithmically in n, we have that conditions (12) - (14) are satisfied 
if both 1 > 2C + and i?/4 > Finally, to ensure that the sequences a,i, c„, 
e„, /„, Qn and /i„ are all of order 0{n'') for > 0, we additionally require that 
e/4 > C, where 2^ = n}~^. Thus, in the {e,d,C) space, the set A of parameter 
configurations which are "admissible" in the above sense has the form 

A = {(e, ,9, C) :ee (0,1/3], Ce (0,min(^/4, 1/2-^9/2, e/4))}. (15) 

In view of the above discussion, the following corollary can be formulated. 

Corollary 1. Suppose that Assumptions 1, 2, 3, 5, 6, 7, 8, 9, 10 and 11 hold. 
Let A be as defined in (15), and let M = Oin^) andb = 0{n-'^). If{e,d,C) G A 
and 5 > ©(log"^' n) for some v < 0, then there exists > such that, uniformly 
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over {j,k) €ln, 



P 



Kj,k-qa{q/n) 



> 6 



<0{n'^ cxp 



^2? 



5. Data-driven wavelet-Fisz estimation (for h unknown) 

Our algorithm for computing the wavelet-Fisz estimator of a if the function h 
is unknown (which we also call the data-driven wavelet-Fisz estimator of a), 
proceeds in the same way as the wavelet-Fisz algorithm for h known, described 
in detail in Section 3, the only exception being that we use our estimate h from 
Section 4, instead of the true h. 

To be more precise, we replace our thresholds Aj^fe, defined in formula (6) 

and subsequently used in the threshold estimators jlj^l and (see formulae 
(7) and (8)), with thresholds 

A„fc - h'/-' ^j,k-<iXg^ {2 log(#l-„)}'/'- (16) 

We denote the thus constructed data-driven wavelet-Fisz estimator of a by 
a^'^^ where (e) is one of: (s) (soft thresholding) and {h) (hard thresholding). 
The following theorem quanitifies the mean-square rate of convergence of a'^'^K 

Theorem 3. Let (e) denote either one of: (s) or (h). Suppose that Assump- 
tions 1-11 hold. Let A he as defined in ( 15), and let M = 0(n^) and b = 0{n~''). 
If (e, z?, C) e A, then 

sup E||a(^) - a||i^[o.i] = O {(log n/n)2™/(2™+i) | 

Comparing this result with Theorem 1, we note that the estimator a^'^'' does 
not exhibit any loss of asymptotic efficiency compared to a*-*^' despite the fact 
that it uses an estimate of h rather than the true h. 



6. Estimation of h, implementation issues 

This section briefly describes the outcome of an extensive simulation study 
aimed at assessing the performance of the estimator h(u) for various parameter 
configurations. Recall that h{u) is assumed to be non-decreasing (see Assump- 
tion 2). As h{u) is not guaranteed to be non-decreasing, in practice we used 
the following computational "correction" to h{u). Having obtained h(u), we in- 
put it into the (automatic) "pool- adjacent- violators" algorithm for least-squares 
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isotonc regression, described in detail in Johnstone and Silverman (2005b), Sec- 
tion 6.3. The resulting estimate, denoted hereafter by h{u), is a non-decreasing, 
piecewisc constant function of u, which is as close as possible to h{u) in the 
least-squares sense. Empirically, we found that the use of h(u), rather than 
h{u), in the estimator a^'^\ results in a slightly superior performance of the 
latter. (We note that literature on monotone estimation of variance function 
is sparse; Dette and Pilz (2007) consider estimation of a monotone conditional 
variance in nonparametric regression.) 

Having investigated various choices of the span M and the bandwidth b for 
a range of test functions and noise distributions, we found that h{u) performed 
particularly well for "small" values of AI. Any value of M < 3 consistently 
resulted in good estimates. The examples later in this section use the value 
M = 3. The estimator seems to be less sensitive to the choice of b: this is due 
to the computational correction (described above), which "smooths out" any 
remaining wiggles in h(u). We recommend an "automatic" choice of 6, such 
as that described in Gasser et al. (1991) and conveniently implemented in the 
routine glkerns from the R package lokern. We also use the default kernel 
function K{-) from the above routine. 

We briefly illustrate the performance of h{u) on 4 simulated datasets. The 
models are: the Poisson model, whereby Xt ^ Pois{a{t / n)} , and the exponential 
model, in which Xt ^ a(t/n) Exp(l). With each model, we use two functions 
a{z): the blocks function, scaled and shifted to have the minimum (maximum) 
value of 1 (22.6), and the bumps function, with the minimum (maximum) value 
of 3 (23.21). Both functions are sampled at 2048 equispaced points. 

Figure 1 shows sample paths from each model, together with the correspond- 
ing estimates h}/'^{u). The estimator performs well in all cases. The good per- 
formance is not incidental: indeed, we found that for the parameter choices 
described above, the estimator h{u) performed well across all simulated exam- 
ples. 

7. Estimation of a implementation issues 

This section discusses the choice of the various parameters for our data-driven 
wavclct-Fisz estimator cS'^^ . The examples in this section use Haar wavelets: this 
choice is motivated in Section 8 below. As a default option, we use translation- 
invariant (see Nason and Silverman (1995)) hard thresholding with J* = J — 2, 
as this parameter configuration seems to offer the best empirical performance. 
We use the variance estimator h{u) described in Section 6 with M = 1. For each 
j,k, we choose the parameters Hj,k-t to be constant for t G supp('!/'j,A:_.). This 
is a natural choice for Haar wavelets as the coefficients Kj,k-qXq are already 
available to us as "by-products" of the discrete Haar transform. 

Figure 2 shows the outcome of our estimation procedure described above for 
the sample paths from Figure 1 . It is clear that our procedure performs very well 
for Poisson noise. Performance for exponential noise is also satisfactory given 
how noisy the original signals are: indeed, it is extremely hard to identify some 
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Fig 1 . Left column, from top to bottom: sample paths from the Poisson model with the blocks 
and bumps functions, sample paths from the exponential model with the blocks and bumps 
functions. Right column: corresponding estimates h^^^{u) (step functions) and the true stan- 
dard deviation functions h^^'^{u) (continuous functions). 

of the features of the clean bumps and blocks signals from the visual inspection of 
the corresponding exponential datasets. We mention again that our estimation 
procedure "does not know" any characteristics of the noise, and estimates the 
variance function h{u) from the data. 

We end this section with a brief comparison study of our estimator versus 
Gao's (1997) estimator for general heteroscedastic noise. The better performance 
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Table 1 

Mean-square errors over 100 simulations for the 4 models, for Gao's method (Gao) and the 

Data-driven wavelet- Fisz (DdwF) 





blocks exp 


blocks Pols 


bumps cxp 


bumps Pols 


Gao 


7.10 


0.93 


3.01 


0.91 


DdwF 


4.02 


0.52 


2.51 


0.54 





500 1000 1500 2000 500 1000 1500 2000 



Fig 2. Top row: estimates a^^^ for the Poisson model (l^locks, left; bumps, right). Bottom 
row: analogous results for the exponential model. Solid lines are the estimates, dashed lines 
are the true signals. 

of our method is unsurprising as Gao's method does not take into account the 
mean-variance relationship and thus uses less information. 



8. Data-driven wavelet-Fisz transform 

In this section, we describe a wavelet-domain variance-stabilising transform im- 
plied by our data-driven wavelet-Fisz estimation procedure. 

Note that the computation of the data-driven wavelet-Fisz estimate a'-'^^ can 
be performed in the following three steps. 

1. Take a DWT of the data. For each j = 0, . . . , J - 1 and fc = 1, . . . , 2J, 
divide the coefficient Yj.fc by h^^^i^g Kj,k-qXq)- Take the inverse DWT 
of the modified coefficients. Call the resulting vector Xt. 
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2. Smooth Xt by means of a standard nonlinear wavelet thresholding proce- 
dure suitable for i.i.d. Gaussian noise, using the same wavelet family as in 
Step 1. To be more precise, we apply the threshold {2 log(#Z„)}^/^ for 
(j, k) S 2,1, and set the empirical coefficients to zero for j G [J*, J — 1]. 
Either soft or hard thresholding can be used. 

3. Take the inverse transform to that described in Step 1. 

Wc call the transform in Step 1 the data-driven wavelet-Fisz variance-stabilising 
transform. Empirically, it stabilises the variance of Xt and brings its distribution 
closer to Gaussianity. The mechanism of the transform was already explained 
in the discussion underneath formula (9). 

In our simulations, we found that the distribution of the "noise" in the trans- 
formed vector Xt was the most symmetric when Haar wavelets were used. This 
was due to the fact that Haar wavelets arc symmetric in the sense that the pos- 
itive part of each Haar wavelet vector is an exact shifted version of its negative 
part. 

Figure 3 compares the data-driven wavelet-Fisz transform (with parameters 
as in Section 7) of the exponential bumps dataset, with its logarithmic transform. 
Note that the latter acts as an exact variance-stabiliser, due to the multiplicative 
structure of the model. However, it is clear from the plot that not only does 
the data-driven wavelet-Fisz transform also stabilise the variance of the noise 
very well, but in addition it brings the distribution of the noise much closer to 
Gaussianity than the log transform (and does this without knowing the original 
noise distribution or the structure of the model). It also seems to bring out the 
shape of the underlying signal more clearly. 

From a computational point of view, in view of the Gaussianising and variance- 
stabilising action of the data-driven wavelet-Fisz transform, the analyst wishing 
to find out more about the shape of the underlying signal may choose to apply 
any smoother suitable for i.i.d. Gaussian noise to the wavelet-Fisz-transformed 
dataset. 
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9. Discussion 

We conclude with three final remarks. 

Applications. Readers interested in the applications of the data-driven wavelet- 
Fisz methodology are referred to our earlier work Motakis vt al. (2()0()) and 
Fryzlewicz et al. (2007), where we proposed computational procedures related 
to that described here and applied them to gene expression data, and solar 
irradiance data, respectively. Those heuristic algorithms were not accompanied 
by any theoretical analysis, leaving a gap which is filled by the present work. 

Density estimation. In practice, our data-driven wavelet-Fisz estimator can also 
be applied to the problem of density estimation from binned data. Although 
this problem does not exactly fall into the class of models described by formula 
(2), it can be approximated, in a certain asymptotic regime, by the Poisson 
model, which is a sub-case of (2). We mention that Brown et al. (2007) propose 
a wavelet-based method for density estimation from binned data which includes 
a time-domain variance-stabilising transform as an initial step. 

Non-equispaced design. Jansen et al. (2009) mention the possible use of the 
Haar-Fisz transform for Poisson data on graphs and in other "irregular multi- 
dimensional situations" . We believe that similarly, the data-driven wavelet-Fisz 
methodology could also be used in such set-ups. 

Appendix A: Proof of Theorem 1 

We first state an auxiliary result. Denote (t|j, = Va.i-{e j,k) = X^t "^f fe-t^{<^(V"-)} 
(see also formula (4)). We specify the following assumption on an arbitrary set 
of deterministic (non-random) thresholds X^^'^l . 

Assumption 12. 

max aJ.J = 0(log^/^n), (18) 

where (f> is the standard normal density. 

For the purposes of this section, we extend the notation a^'"' to mean any 
estimator constructed as in Section 3, using an arbitrary set of thresholds Xj^k- 
To emphasise the dependence of a^'^'> on Xj.k, we will wr ite a^^HXj.k). 

Theorem 4. Let xfl he any non-random thresholds satisfying Assumption 12. 
Further, suppose that Assumptions 1, 3 and 4 hold. We have 

sup E\\a^^\xf,) - = O {(log r./n)^"/(2"+i)| . 
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The proof of Theorem 4 proceeds exactly like the proof of Theorem 5.2(ii) in 
Neumann (1996). We omit the details. 

We now define what we call "lower" and "upper" deterministic thresholds 
AjS'^and A^"^: 

Aj'fc^ = lnh'^'{Y.^J.'^-MlM\{2logmn)V^' (19) 




(20) 

where 7„ is a sequence approaching one from below. C is a generic positive 
constant (see Lemma 1 for the permitted range of C) and Kj,r are as in formula 
(6). Proving that the lower and upper thresholds satisfy Assumption 12 (and 

thus that Theorem 4 holds for ci^'^H-^j'fc''') ^^'^ o;^'^-' (A^'*^"'')) is a step on the way 
to proving Theorem 1. 

Lemma 1. Suppose Assumptions 1, 2 and 5 hold. If C > (2/i)^/^, then for all 
j,k, Aj''^"^ > f^X^^f^, and both A^'*^/^ and X^^lf^ satisfy Assumption 12. 

Proof. It is easy to check that if C > {^hf/'^, then for all j,k, > 

supjjjAj'*/^ We first check (17) for A^'^^;''. The factor x'^^j^^ /uj^k + 1 only con- 
tributes a logarithmic term so we skip it. Denote a^f. = inf{Q;(i/ri,) : t G 
supp(V'j,fc)} and Uj^k — sup{Q!(t/n) : t G supp(-0j,fe)}. Further, let TV(/)|a 
denote the total variation of the function / measured on the set A. Using As- 
sumption 2, we bound X^^f lo'j,k from below as follows: 

Ag^ ^ 7nfe^/^(«,,,){2 log(#T„)}V^ ^ 7»fe^/^(g,-...){2 \ogmu)Y'^ 
^ 7n/^^/^(a,,fc){2 log(#X„)}V^ ^ 7„/.i/2{2 log(#X„)}V2 



h^'^{(l,,k) + ^^TV(«)|supp(^,,o k^'^ + i/TV(a)|supp(^,^. ' 

where the last inequality follows from the fact that v{x) = x/{x -\- a^) is in- 
creasing on [0, oo). As in Ncnunaun (199()), the proof of Lemma 6.1(ii), we have 
X^fe TV(a)|supp(i^j_fc) < 0(l)TV(a)|[o,i] and thus for a sequence u;„ 0, at 
each scale j we have 

: TV(a)|supp(v.,,.) > ^n} = 0«^). (21) 

Denote P„ = Z„ n {{j,k) : TV(a)|supp('!/'j k) > ^n} note that by (21), at each 
scale j at most 0{u!^^) coefficients are in T>n. Denote further = Z„ \ I?„. We 
have 



5„ £n I 
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n 



0\ (#X„)'"'"(-""^") Uo(^-ilogn) + o{7Vi/(2™+i)}, 



for any m > 0. The last equality follows from the fact that 1— 7„ h+Huj J ^ 
Choosing ujn = log~^ n (say), we have that (17) is satisfied irrespective of the 
smoothness parameter m. Because the thresholds A^''^"'' are higher than 
(17) also holds for A^''^"'. Obviously, (18) holds for xfl"\ which implies that it 
also holds for Aj'j.'^ , since Aj**/^ are lower than A^"*^"^ . □ 

We now state another auxiliary result. We first specify an assumption on an 
jitrary set of rar 

Assumption 13. 



' (r) 

arbitrary set of random thresholds A^- f, . 



E P(^l<^n = 0{nn (22) 



for some 7„ —>■ 1_ (see the definition of ^^^jP in formula (19)), some v 



< 



l/(2m + 1) (with m given in Theorem 1), and some C > (2/i)^/^ (see the 
definition of \^^^^ in formula (20)). 

Theorem 5. Let A^'^ '^'^V I'o.ndom thresholds satisfying Assumption 13. Fur- 
ther, suppose that Assumptions 1, 2, 3, 4 cind 5 hold. We have 

sup E||fi('=)(A;.2) - a||i^[o,i] = O {(log n/n)2W(2™+i)\ 

The proof of Theorem 5 proceeds exactly like the proof of Theorem 6.1 in 
Ncumaim (1996). We omit the details. 

In view of Theorem 5, in order to prove Theorem 1, it suffices to show that 
our random thresholds Xj^k, defined in formula (6), satisfy Assumption 13. 

Lemma 2. Suppose Assumptions 2, 3 and 5 hold. There exists a 7„ — > 1_ and 
a C > (2/i)^/^ such that our random thresholds Xj^k, defined in formula (6), 
satisfy Assumption 13 for all v < 1/ (2m + 1). 
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Proof. Wc start with (22). To shorten notation, denote u — J2q '^j.k-qXq and 
u = Kj_k~qOi{q/n). Denote further 

Note that Vn ^ ^. We have 

= P{h\u) - h\u) > (1 - 7,f < P{\u -u\> iy„). 

Suppose Vn tends to zero logarithmicaUy fast in n (which is easy to ensure by 
placing an appropriate assumption on the speed of convergence of jn to one). 
Then, by Lemma 8, there exists e > such that 

P{\u-u\ > Vn) < C'4exp f - — 

Summing up over j, k we obtain 

^ P(A,,fe < Ag')) < C'4nexp = o{nn, 

for any which shows (22). The technique for showing (23) is exactly the same. 
We omit the details. □ 

The proof of Theorem 1 is complete. 
Appendix B: Proofs of results of Section 4 

Proof of Proposition 1. Let z be any point such that u = a{z) and let A 
be the Lipschitz constant for a. Denote 

= {t e 1, . . . , n : \a{t/n) - a{z)\ < A\t/n - z\}. 

Because a{z) is piecewise Lipschitz-continuous of order 1, it is clear that the 
cardinality of is uniformly bounded from below by cn where c G (0, 1) and 
that contains those t for which t/n is arbitrarily close to z from either the 
left- or the right-hand side (or both). We have 

|, i^atti^-j ^ |, ( ""''"'; °'" ) n(l"('/..) - < V4) 

^ E -^^(H^H - < V4) > E ^^(H^H - < V4) 

t=i tesj 

> y 4-1(^1*/"- ~ z\< > ^nbci = ci, 
^-^ no no 

which completes the proof. □ 
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Lemma 3. // Assumption 3 holds, then there exists a constant K > such 
that, for ^ = 3, 4, . . 



K 



1-2 



(2M + 1)1/2 



t+A/ 



p=t-M 



K 



(2M + 1)1/2 



Proof. Let C be a generic positive constant. Recall that by Stirling's approx- 
imation, 

< cxp{l - 1/(12/ + l)}/!(27r/)-i/2 (24) 

Using the Rosenthal inequality (Rosenthal (1970); Johnson ct al. (1985)), and 
then (24) and Assumption 3, we have 



E|at - E(at)|' = (2Af + 1)~'E 



t+M 



t+M 

^ - a{p/f 

p=t-M 

t+M 



< (2Af + 1) 



111 



, cn 

log' / 



1/2 ' 



max< ^ E\Xp - a{p/n)\\ \ ^ h{a{p/n)} 



p=t-M 

-ll 



< {2M + i)-n\p{i) 



t+M 



t+M 



1/2 ■ 



max<^ (n)i+^iC'-2 ^ h{a{p/n)}A ^ h{a{p/n)} 



(25) 



p=t-M 



ip=t-M 



where 



p{l) = C' cxp{/ - 1/(12/ + l)}(27r/)-i/2 log"' /. 
Noting that p{l) < const', we observe that (25) can comfortably be bounded by 

s 1-2 



t+M 



(2Af+l)-2 J2 h{a{p/n)}{llf+'' 

K 



K 



^t-M 



(2A/+ 1)1/2 

1-2 



Var(at)(^!) 



n2+7 



(2Af + 1)1/2 



for a constant K > Q, which completes the proof. 

Lemma 4. For a nonnegative random variable X and I > 1, we have 

E|X -E(X)|' < 2E{X'). 



□ 
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Proof. Noting that for c > 0, wc have \x — c|' < |a;|' + c', wc obtain 

E\X - E(X)|' < E(X') + {E(X)}' < 2E{X'), 
where the last step uses Jensen's inequahty. □ 
Lemma 5. Denote Zt = \at — E{at)\. Under Assumptions 3 and 8, we have 



n2+7 



E\Zt-E{Zt)[ < Var(Zt)(Z!) 



for / = 3, 4, . . where K is a positive constant. 



(2M + 1)1/2 



Proof. Using Lemma 4 with X = Zt and then Lemma 3 and Assumption 8, 
we have 



E|Zt - E(Zt)|' < 2E\at - E(dt)|' < Var(dt)(;!)2+T 



N 1-2 

2K 



(2M+ 1)1/2 



< Var(Zt)(/!)'+^ 



(2Af + 1)1/2. 

for K 2C2A'. □ 

Lemma 6. Denote it = ~ h{a{t/n)}. Under Assumptions 2, 3 and 7, we 
have 

E|£t|' < (;!)3+3^X{-2var(£0, 

for I — 3,4, . . ., where 

Var(ef) = Var(e2) < 2A^+''K^h{ait/n)} - /i2{a(t/n)}. 

Proof. Applying Lemma 4 to £j and then using Assumption 3, we get 

E\et\^ < 2E|£t|2' < 2{{2iyy+''K^^-^h{ait/n)}. 

Using the fact that (2/)! < 4(^!)^ and Assumption 7, we bound the above by 
{l\)^+^'^K[~'^YaT{et), which completes the proof. □ 

Lemma 7. Suppose that Assumptions 1, 2, 3 and 8 hold, and that S — 6n is 
such that 

Sn-'/^ - Mn-^'^ - ji1/2m-i/2 -> 00. 

Let bn be any fixed sequence such that bn = o{(n/A/)i/'-^+^'''-'}. In the asymptotic 
limit, as n, M, n/M — > 00, we have 

P \at - a{t/n)\ > 5^ < C3(2M+ 1){1 - $(min(a„, 6„))}, (26) 

where C3 is a positive constant, $ is the cdf of the standard normal, and the 
sequence a„ satisfies 

a„ = 0((5n-i/2 - Mn-1/2 - ni/2M-i/2). 
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Proof. We first note that if Assumption 1 (i) liolds, tlien 



2mTi E E HpM-c^itM\<c^ (27) 

t=i,i+l+2M,i+2+4M,... p=t-M 

uniformly over i and M, wfiere Ci is a positive constant. Summing up botli 
sides of equation (27) over i, we have J2t=i |Q;(i/") — IE(at)| < Ci(2A/ + 1). We 
bound 

p(^Jat-a{t/n)\>6^ < P (^Jat -E{at)\ + \Eiat) - a{t/n)\ > 6^ 

< p|^X^|at-E(dt)| >(5-Ci(2A/+l)^ (28) 

Denote 6' 6 - Ci{2M + 1). Tlic sequence {at}t is (2A/ + l)-dependent. To 
avoid comphcations which arise in deriving exponential inequalities for depen- 
dent sequences, we split it into independent sequences as follows. Rewriting the 
LHS of (28) as 

p[Y. E \at~n^t)\>5'Y (29) 

and using the fact that ai + . . . + am > 6 =^ 3i ai > S/m, as well as the 
Bonfcrroni inequality, we bound (29) by 

(2A/+l)maxP ^ |at - E(at)| > (57(2A/ + 1) 

\t=i,i+l+2ALi+2+4:M,... 

We drop the range of t to shorten notation, and assume without loss of generality 
that there are exactly n/{2M + 1) terms in the above sum. Denoting Zt = 
\at — E(d()|, we bound the above by 



{2M + 1) maxP Zt - E{Zt) > 5'/{2M + 1) - XI ^i^*)^ 

Wc first assess J2t^i^t)- 

XE(Zt) < X{var(at)}i/2 < nt^\2M + l)-^/^. 
t t 

Denote 

^„ _ S'/{2M + 1) - nt'^{2M + l)-^/^ 
{EtVar(ZO}^/' 



(30) 
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Zt - E{Zt) 



{EtVar(ZO} 



1/2- 



Note that by Assumption 8, we have S" = 0{Sn-^/'^ - Mn^^/'^ - n^/'^M-^/'^) 
and, by assumptions of the lemma, (5" — > oo. We now apply Theorem 1 and its 
Corollary from Rudzkis ct al. (1978) to the standardised sum ^. By Lemma 5, 
in our case the quantity A„ from the above Theorem takes the form 



. _ {EtVar(Z,)} 

ZA^i — - 



1/2 



2max{A:(2Af + l)-i/2^ niaxt{Var(Zt)}i/2} ' 

which, by Assumption 8, is of order 0{{n/MY/^}. Recall that 6„ = 
o{(n/M)^/(^+^'''^}. Using the above Theorem, we bound (30) by 

(2Af + 1) maxP(^ > 5") < {2M + 1) maxP{^ > mm{S" , 6„)} 

i i 

< C3(2M + l){l-$(min(y',6„))}, 

which completes the proof. □ 

Lemma 8. Let the constants Kj_r satisfy Assumption 5. Suppose Assumptions 2 
and 3 hold. Recall that n = 2"^. Let 6„ be any fixed sequence such that 6„ = 

a ^22(i+">'"'<T-i" ^ uniformly over < j < J* — I. Ln the asymptotic limit, as 

n — > oo, and uniformly over < j < J* — 1, we have 



P 



^ Kj^riXr - a{T/n)} 



>S] < C4(l - $(min(a„,6„))), 



where C4 is a positive constant independent of j , <!> is the cdf of the standard 
normal, and the sequence a„ satisfies 

a„ = 0(5min2(^-J)/2). 
j 

Proof. Denote Xr = i^j,T{Xr — a{T/n)}. We have Var(X^) = k'^ ^h{a{T / n)} 
and, by Assumption 3, 

E\Xr\^ < {ll)^+'>iKKj,ry-'^Ya.T{Xr) < (11)^^'' {K ma.x KjA'~^YaT{Xr) (31) 

A; 

for l = 3,i,.... Denote ^ = Er ^r/{Er Var(X,)}i/2 and S' ^ 5/{E, Var(l,)}i/2. 
By Assumption 5, 6' = 0(52(-'-J)/2). We now apply Theorem 1 and its Corol- 
lary from Rudzkis ct al. (1978) to the standardised sum ^. By (31), in our case 
the quantity A„ from the above Theorem takes the form 

. {E.Var(X.)}V^ 
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which, by Assumption 5, is of order 0(2("'~^'/2) Recall that fe„ = o I 22(i+.»ax(T,i)) 



Using the above Theorem, we bound 

> S') < Pm > min(<5',M) < C4(l -$(min(5',6„))), 
which completes the proof. □ 

We define ujntiu) = Wnt{u)/ jyt=i^rit{u). Obviously I]"=i'^nt(") = 1; 
and if Assumptions C and 10 hold, then w„t(u) < K/(cinh) = 0(n^^b^^) 
so that Y^^=i^nti'^) — 0{n~^b~^). By Cauchy inequality, we also have 1 = 
{J2^=i ^ntiu)}"^ < JT-X^tLi "^ntC")' which implies X]r=i '^nt(^) — Summaris- 
ing the above bounds, 

Oin-'b-')>j2^ltiu)>l/n. (32) 

Lemma 9. Suppose Assumptions 2, 3, 6, 7 and 10 hold. Let bn be any fixed 
sequence such that bn = o ((71,6^)^/^^'^+^^''')). In the asymptotic limit, as nb — > 00, 
6 — > 0, we have 



P 



^w„t(u)(e^ - h{a{t/n)} 



t=i 



>S] <C5(l-$(min(a„,6„))), 



where C5 is a positive constant, $ is the cdf of the standard normal, and the 
sequence a„ satisfies 

an=0(5n^''^b). 

Proof. Denote et = cj„t{u){ef — h{a{t/n)}). By Lemma 6 and Assumption 10, 

E\et\^ < ujl,{u){llf+^^ K[-''YaTie^t ~ h{ait/n)}) < ')'"'Var(et), 

for / = 3,4,..., where K' = Kik/{cinb) = 0{n~^b~^). Denote ( Et ^t/ 
{Et Var(£t)}i/2 and S' = ^/{Et Var(£t)}i/2. By (32), 5' > We now 

apply Theorem 1 and its Corollary from Rudzkis ct al. (1978) to the standard- 
ised sum ^. In our case the quantity A„ from the above Theorem takes the 
form 

" 2max{A",maxt{Var(et)}^/2}' 

which, by Assumption 10 and (32) is of order at least 0{n^^^b). Recall that 
bn = o ((nfe^)^/'^"'*"^^''')) . Using the above Theorem, we bound 

Pi\^\ > S') < P(|e| > min(5',6„)) < C5(l - $(min((5', 6„))), 

which completes the proof. □ 

Lemma 10. Suppose that Assumptions 1, 2, 3, 6, 8 hold, and that the con- 
stants Kj^r satisfy Assumption 5. Let bn, dn be any fixed sequences s.t. bn = 
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o{(n/A/)i/(6+''^)} and d„ = o [mmj 22<i+— (^.D) j , where < j < J* - 1. Lei S 
be such that 

as M,n,n/M — *■ oo and b 0. In the asymptotic limit, as n, M,n/M — > oo 
and b 0, uniformly over j, k, we have 



< C^{2M + 1) {1 - $(min(a„, 6„))} + C4 {1 - $(niin(c„, d„))} , (33) 
where C3, C4 are as in Lemmas 1 and 8 and $ is the cdf of standard normal. 

c„ = 0((5&2min2(^-J)/2). 

j 

Proof. Using the Lipschitz-continuity of the kernel funetion K{-), we have 

Wnt ^^^Kj,k~qa{q/n)^ - Wnt Kj^k-qX^ 

- I - a* I - i^3-k-q{Xq - a{q/n)} 

Thus, we bound the probabihty on the LHS of (33) by 



P(^|a(t/n)-d.|>^)+p 



2L 



^Kj^k-qiXg - aiq/n)} 



> 



2L 



□ 



Lemmas 7 and 8 yield the result 

Proof of Theorem 2. To shorten notation, denote u = J2q '^j,k-qOi{q/n) and 
u = Kj_k-qXq. By Assumption 11, u G range{Q;(z)}. We bound 

P{\h{u)-h{u)\ >S) < P{\h{u)^h{u)\ > S/2) + P{\h{u)-h{u)\ > 5/2) =: I+II, 
where 



h{u) 



We first consider /. Again to shorten notation, denote 



n n 

C = ^H^„t(u)e?^W^„*(u) 



t=i t=i 

n n 
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and also 



We bound 



E 



t=i 



I = P{\A- B\>SD/2) 

= P{\A - B\> 5D/2\E)P{E) + P{\A -B\> SD/2\E'')P{E'') 

< P{E) +p[\A-B\>^-'j2 Wnt{u) 1^ Wnt{u) - 5i| \E^'^ P{E') 

< PiE)+p(^\A-B\ > ^^M/„,(u)|^M^„,(^)-5i|j 

< PiE)+p(^\A-B\>S'J2Wntiu)^ ■ 
By Lemma 10, 

P{E) < C3{2M + 1) {1 - $(min(a„, 6„))} + C4 {1 - $(min(c„, d„))} . (34) 
We bound 

p(^\A-B\>S'J2Wntiu)^ 

<p(\A-C\>^-J2Wntiu)] + p(|C-i3|>^^iy„t(w)| =:h+h. 



Turning first to Ii, we have 



P 



< P 



P 



n 

t=l 
n 



> 



S' 



> 



5' 



t=i 



S' 



/11 +/i 



We first consider In. Denote 

F = {\/t=l,...,n |et| < a log'^n}, 

wliere a, d are constants from Assumption 9. Using Assumption 9, the assump- 
tions of this theorem, and Lemma 10, we bound 



hi < PiF") + P 



n 

a' log''' n {Wnt (u) - Wr^t (u) } 

t=i 



> 
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0{n 



P 



Y,Wnt{u)-W^t{i 



> 



4a2 log^'* n 



< 0(n-2) + C3{2M + 1){1 - $(min(e„,6„))} 
+ C4{l-$(min(/„,d„))}. 



(35) 



We now consider I12. Denote p(log'^?i) = 3a log'* n + a. Noting that et ~ St 
at — a(t/n), we obtain 

\Ii2\ < P [-^Y.\'t + StWat ~ a{t/n)\ > - \ 



r> n ( t+M 



q=t-M 



a(t/n) 



^ t+M 1 

q=t-M J 



(t/n)| > 



5' 



S'nb 



4:Kp{log n) 



Using Assumption 9, the assumptions of this theorem, and Lemma 7, we bound 
the above by 



I/12I < 0{n-^) + C3(2M + 1){1 - $(min(5„, 6„))}. 
We now consider l2- We have 

'5' 



(36) 



\I2\ < P{F') + P 



> 



2a2 log^"* n 



Using Assumption 9, the assumptions of this theorem, and Lemma 10, the bound 
is the same as that for In: 

\h\ < 0(?7-2) + C3(2M+l){l-$(min(e„,5„))}+C4{l-$(min(/„,d„))}. (37) 

We finally turn to //. We define 

E7=iWntiu)h{a{t/n)} 



h(u) :=E(/i(w)) 



Note that 



\h{u)-h{u)\ < 



EtiWnt{u)\h{a{t/n)} - h{u)\ 



^ Hj:t=iWntiu)\a{t/n)~u\ ^ bH 
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where the last inequahty comes from the fact that Wntiu) is supported on 
[a{t/n) - 6/2, a{t/n) + 6/2]. We bomid 

\II\ < Pi\h{u) - h{u)\ + \h{u) - h{u)\ > 5/2) < Pi\h{u) - h{u)\ > S/2 - bH/2), 
which, by the assumptions of this theorem and by Lemma 9 is bounded by 

\II\ < C5(l-$(min(/i„,i„))). (38) 

Combining (34), (35), (36), (37) and (38) yields the resuh. □ 

Proof of Corollary 1. Denoting by 0(x) the standard normal pdf, and re- 
calling that for large x, we have 1 — ^{x) < (^(cc). Corollary 1 is a direct conse- 
quence of Theorem 2 and the discussion directly underneath it. □ 



Appendix C: Proof of Theorem 3 

In view of Theorem 5, it suffices to show that our thresholds Xj^k satisfy As- 
sumption 13. The following lemma holds. 

Lemma 11. Suppose that Assumptions 1, 2, 3, 5, 6, 7, 8, 9, 10 and 11 hold. 
There exists a 7„ — > 1_ and a C > (2/i)^/^ such that our random thresholds Xj^k, 
defined in formula (16), satisfy Assumption 13 for all —1 < v < l/(2?Ti -|- 1). 

Proof. We start with (22). To shorten notation, denote u ~ Kj,k~qXq and 
u = J2q i^j,k~qC^iQ/n). We have 

< Aj.j;'') = P{h'^Hu) < Inh'^'iu)} = P{hiu) < -ffM^)} = 
P{h{u) - h{u) > (1 - il)h{u)} < P{\hiu) - hiu)\ > (1 - ^l)h}. 

Suppose 7^ converges to one logarithmically fast in n. Then, by Corollary 1, the 
above probability can uniformly be bounded by 0(n^^). Summing over j, k, we 
obtain 

^ P(A„fc<A5;^/')<0(n-i)<0(n^), 

which shows (22). The technique for showing (23) is exactly the same. We omit 
the details. □ 
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